Unify SIMD arithmetic under a shared transform_binary template by tigercosmos · Pull Request #740 · solvcon/modmesh

tigercosmos · 2026-04-19T12:18:16Z

Summary

Refs #646 (Task 2). The generic and NEON backends each had four near-identical loops for add / sub / mul / div. This PR collapses them into a single transform_binary per backend that takes the operation as an injected functor.

What changed

In the generic backend, the four ops now pass std::plus / std::minus / std::multiplies / std::divides into transform_binary. In the NEON backend they pass vec_add / vec_sub / vec_mul / vec_div wrappers around neon_alias, and std::invocable<VecOp, vec_t, vec_t> routes types without a matching vector overload (e.g. int64 for vmulq) to the scalar path at compile time. This replaces the ad-hoc vector_lane > 2 and is_floating_point_v guards.

Bugs fixed along the way

Sub-lane UB in NEON. ptr <= dest_end - N_lane formed a pointer before the buffer when the input was shorter than one SIMD lane. The vector loop now runs a counted trip count (blocks = (dest_end - dest) / N_lane), which is UB-safe on sub-lane inputs and lowers to a subs/b.ne macro-op-fused back-edge on AArch64. The scalar remainder is inline instead of a recursive call into generic::. The same rewrite is applied to check_between.
check_between diagnostic. The SIMD body checked the >= max mask first and only looked at < min if the first was empty, so a later too-large lane could hide an earlier too-small one. Both bounds are now inspected before picking the returned pointer.
has_vectype typing. Declared size_t; retyped to bool to match its predicate role.

Profiling

Verified on Apple M3 Pro (clang 17, -O3 -DNDEBUG -mcpu=apple-m1):

transform_binary inlines fully — no functor calls, no extra moves, no spills. Hot loop is 5 instr/iter with one fused control-chain macro-op (subs/b.ne), matching the hand-written master baseline byte-for-byte in shape.
Throughput on cache-resident buffers (n = 256 / 16384) is within ±2% of master across add/sub/mul/div × float/int32/double, well inside run-to-run noise. mul<int64> (scalar fallback) is byte-identical.
The same counted-trip form on check_between measures ~1.8× faster than master in an inlined hot scan, because the new loop shape avoids a register-shuffle the post-incrementing form was triggering under inlining pressure.

Tests

tests/test_buffer_simd.py pins _simd_feature() == "NEON" on aarch64 so a silent fallback to the scalar path cannot pass unnoticed. It then covers the int32 shape matrix (n = 1, 3, 4, 5, 8, 17) for transform_binary, the int64-mul SFINAE fallback, and float sub/mul/div with one block + tail. A new private modmesh._modmesh._simd_feature() binding exposes the runtime-detected backend.

Follow-up

simd::check_between has inconsistent bound semantics across paths: the NEON SIMD body treats value == max_val as out-of-range, while the scalar fallback accepts it. Out of scope here; left for a separate change.

Test plan

make gtest
tests/test_buffer_simd.py on aarch64
CI on Linux / macOS / aarch64

🤖 Generated with Claude Code

tigercosmos

@yungyuc The PR is ready for review. Thanks!

tigercosmos · 2026-04-27T20:18:36Z

+        // every correctness check. Kept under an underscore-prefixed name
+        // because detect_simd() only meaningfully reflects the dispatched
+        // backend on aarch64 today; on other targets it would mislead users.
+        mod.def("_simd_feature", &simd_feature_name);


For checking if simd is working.

cpp/modmesh/toggle/ may be a more on-topic module for the SIMD check, but it's fine to have it here in buffer.

tigercosmos · 2026-04-27T20:19:28Z

+struct vec_add
 {
-    return generic::check_between<T>(start, end, min_val, max_val);
-}
-
-template <typename T, typename std::enable_if_t<type::has_vectype<T>> * = nullptr>
-const T * check_between(T const * start, T const * end, T const & min_val, T const & max_val)
+    template <typename V>
+    static auto operator()(V a, V b) -> decltype(vaddq(a, b)) { return vaddq(a, b); }
+};


Key design in this PR.

tigercosmos · 2026-04-27T20:20:45Z

-        constexpr size_t N_lane = type::vector_lane<T>;
+        if constexpr (!std::invocable<VecOp, vec_t, vec_t>)
+        {
+            generic::transform_binary<T>(dest, dest_end, src1, src2, scalar_op);


T does have a vector type, but the specific VecOp functor can't be called with it. For example, vdivq doesn't exist for integer vector types in NEON, so vec_div{} isn't invocable with int32x4_t.

tigercosmos · 2026-04-27T20:24:54Z

+            {
+                vec_t v1 = vld1q(src1);
+                vec_t v2 = vld1q(src2);
+                vst1q(ptr, vec_op(v1, v2));


vec_op is called here.

tigercosmos · 2026-04-27T20:26:56Z

+    if constexpr (!type::has_vectype<T>)
    {
-        return generic::add<T>(dest, dest_end, src1, src2);
+        generic::transform_binary<T>(dest, dest_end, src1, src2, scalar_op);


The scalar type T itself has no corresponding NEON vector type (e.g., bool, int64_t). There's no vector register representation at all, so SIMD is impossible.

tigercosmos · 2026-04-27T20:34:49Z


-#include <cstddef>
 #include <arm_neon.h>
+#include <cstddef>


Formattor fixes the order, I think it should be fine. Let me know if I should revert it.

tigercosmos · 2026-04-27T20:34:59Z


 template <typename T>
-inline constexpr size_t has_vectype = detail::vector<T>::N_lane > 0;
+inline constexpr bool has_vectype = detail::vector<T>::N_lane > 0;


Fixed the boolean type.

Good catch.

tigercosmos · 2026-04-27T20:35:40Z

+inline void add(T * dest, T const * dest_end, T const * src1, T const * src2)
 {
-    T * ptr = dest;
-    while (ptr < dest_end)
-    {
-        *ptr = *src1 - *src2;
-        ++ptr;
-        ++src1;
-        ++src2;
-    }
+    transform_binary<T>(dest, dest_end, src1, src2, std::plus<T>{});
 }


Main design of this PR.

I am not sure if the additional abstraction still generates good SIMD binaries. Please profile to check. If you have time, also check the built assembly.

I am not sure if the additional abstraction still generates good SIMD binaries. Please profile to check. If you have time, also check the built assembly.

Profiled on Apple M3 Pro (clang 17, -O3 -DNDEBUG -mcpu=apple-m1). Both the assembly and the throughput look clean.

Assembly. The transform_binary<T, ScalarOp, VecOp> template inlines fully — no functor calls, no extra moves, no spills. Hot-loop body for simd_add_f32:

; BEFORE (master) ; AFTER (this PR) ldr q0, [x2], #16 ldr q0, [x2], #16 ldr q1, [x3], #16 ldr q1, [x3], #16 fadd.4s v0, v0, v1 fadd.4s v0, v0, v1 str q0, [x0], #16 str q0, [x0], #16 cmp x0, x8 subs x8, x8, #1 b.ls LBB0_1 b.ne LBB0_2

5 instr/iter on both sides; one fused control-chain macro-op (cmp/b.ls ≡ subs/b.ne for fusion purposes). Same pattern for sub/mul/div/add<int32>/mul<int32>/add<double>. mul<int64> (scalar fallback via std::invocable<VecOp, vec_t, vec_t>) is byte-identical to master.

Throughput (Gelem/s, n=16384, L2-resident, median of 3 runs):

Op BEFORE AFTER Δ

add<float> 14.91 14.78 −0.9%

mul<float> 14.82 14.77 −0.4%

div<float> 14.78 14.76 −0.2%

add<int32> 14.87 14.71 −1.0%

add<double> 5.63 5.69 +1.2%

All ops within ±2% of master across L1/L2-resident sizes — well inside run-to-run noise. Full write-up + reproducer in profiling/simd_pr740/.

Profiled on Apple M3 Pro (clang 17, -O3 -DNDEBUG -mcpu=apple-m1). Both the assembly and the throughput look clean.

The profiling results and assembly look good. But clang 17 looks old. The latest version provided by xcode is version 21.0.0 (clang-2100.0.123.102). Old compiler is OK since both before and after use the same version.

tigercosmos · 2026-04-27T20:36:20Z

+        if platform.machine() in ("arm64", "aarch64"):
+            self.assertEqual(feature, "NEON")


Check if NEON is working that we didn't test it before.

tigercosmos · 2026-04-27T20:37:36Z

+            self.skipTest("_simd_feature() = " + feature)
+
+
+class SimdTransformBinaryTC(unittest.TestCase):


Some cases for checking transform_binary functionality.

Why do you isolate this unit test out from test_buffer.py?

The whole SIMD implementation is also outside buffer directory. I think it worths a new file.

yungyuc · 2026-04-27T22:53:00Z

@KHLee529 Could you please take a look?

KHLee529 · 2026-04-28T10:48:38Z

The unified backend look nice in my first glance. I'll dive into details later.

yungyuc

Clarify if some for loops can also be replaced with while.
Run performance test to compare the runtime before and after the change. List the results to show that the change does not degrade runtime performance.
Rename test_simd.py to test_buffer_simd.py. We can discuss which name is better.

yungyuc · 2026-04-29T11:16:28Z

+        // every correctness check. Kept under an underscore-prefixed name
+        // because detect_simd() only meaningfully reflects the dispatched
+        // backend on aarch64 today; on other targets it would mislead users.
+        mod.def("_simd_feature", &simd_feature_name);


cpp/modmesh/toggle/ may be a more on-topic module for the SIMD check, but it's fine to have it here in buffer.

yungyuc · 2026-04-29T11:21:15Z

+        // Vector loop runs while a full lane still fits. The remaining-count
+        // form keeps the condition valid for buffers shorter than one lane.
+        T const * ptr = start;
+        while (static_cast<size_t>(end - ptr) >= N_lane)


while reads clearer than for.

yungyuc · 2026-04-29T11:21:26Z

-        if (ptr != dest_end)
+
+        // Tail scalar loop for remaining elements
+        for (; ptr < end; ++ptr)


Why not using while too?

yungyuc · 2026-04-29T11:22:00Z


-#include <cstddef>
 #include <arm_neon.h>
+#include <cstddef>


yungyuc · 2026-04-29T11:22:17Z


 template <typename T>
-inline constexpr size_t has_vectype = detail::vector<T>::N_lane > 0;
+inline constexpr bool has_vectype = detail::vector<T>::N_lane > 0;


Good catch.

yungyuc · 2026-04-29T11:24:15Z

+inline void add(T * dest, T const * dest_end, T const * src1, T const * src2)
 {
-    T * ptr = dest;
-    while (ptr < dest_end)
-    {
-        *ptr = *src1 - *src2;
-        ++ptr;
-        ++src1;
-        ++src2;
-    }
+    transform_binary<T>(dest, dest_end, src1, src2, std::plus<T>{});
 }


I am not sure if the additional abstraction still generates good SIMD binaries. Please profile to check. If you have time, also check the built assembly.

yungyuc · 2026-04-29T11:26:38Z

Since most tests are against SimpleArray, I suggest to name the new test file as test_buffer_simd.py?

KHLee529

No change requested. Only some comments and questions listed.

KHLee529 · 2026-04-29T11:55:12Z

-template <typename T, typename std::enable_if_t<type::has_vectype<T>> * = nullptr>
-const T * check_between(T const * start, T const * end, T const & min_val, T const & max_val)
+    template <typename V>
+    static auto operator()(V a, V b) -> decltype(vaddq(a, b)) { return vaddq(a, b); }


Can these operator helper functions be also inlined? Based on my experience profiling the speed of SimpleArray SIMD operations, whether the vector operations are inlined impact a lot on the performance

They are implicit inlined, and confirmed by the profiling.

KHLee529 · 2026-04-29T11:57:12Z

+            }
+            while (ptr < dest_end)
+            {
+                *ptr = scalar_op(*src1, *src2);


Nice way to remove dependency to generic functions.

KHLee529 · 2026-04-29T12:04:22Z

    {
-        T idx = *ptr;
-        if (idx < min_val || idx > max_val)
+        if (*ptr < min_val || *ptr > max_val)


Is this refinement potentially slower due to one more dereference execution?

No — clang CSEs the two textual *ptr reads into a single load per iteration. Same one ldr in both versions:

; BEFORE ; AFTER ldr w10, [x0], #4 ldr w10, [x0] cmp w10, w8 cmp w10, w8 ccmp w10, w9, #0, ge ccmp w10, w9, #0, ge b.le ... b.gt ... add x0, x0, #4 cmp x0, x1 b.lo ...

No extra dereference. In fact, when this function inlines into a hot caller the new form is measurably faster (~1.8× on a tight scan loop, M3 Pro) — the for (...; ++ptr) shape decouples the load operand from the pointer-bump, which gives the register allocator more freedom and avoids a redundant register-shuffle that the post incrementing ldr [x0], #4 triggers under inlining pressure.

KHLee529 · 2026-04-29T12:08:58Z

+            self.skipTest("_simd_feature() = " + feature)
+
+
+class SimdTransformBinaryTC(unittest.TestCase):


Why do you isolate this unit test out from test_buffer.py?

tigercosmos · 2026-05-06T01:18:35Z

+        // Vector loop runs while a full lane still fits. Counted trip form
+        // for the same reason as transform_binary above: avoids UB on
+        // sub-lane inputs and the per-iter `sub` overhead.
+        size_t const blocks = static_cast<size_t>(end - start) / N_lane;
+        T const * ptr = start;
+        for (size_t block = 0; block < blocks; ++block)


An intermediate revision used while (dest_end - ptr >= N_lane) for the bound check. That form is UB-safe but forces clang to emit a non-flag-setting sub + cmp #12 pair, which breaks macro-op fusion on AArch64 and showed a real ~20–25% regression on cache-resident loops. The current head replaces it with a counted trip count (blocks = (dest_end - dest) / N_lane), which restores fusion. That is what the numbers above measure.

I did not know that. Good to learn.

Refs solvcon#646 (Task 2). The generic and NEON backends each had four near-identical loops for add/sub/mul/div. Collapse them into a single transform_binary per backend that takes the operation as an injected functor. In the generic backend, the four ops now pass std::plus / std::minus / std::multiplies / std::divides into transform_binary. In the NEON backend they pass vec_add / vec_sub / vec_mul / vec_div wrappers around neon_alias, and std::invocable<VecOp, vec_t, vec_t> routes types without a matching vector overload (e.g. int64 for vmulq) to the scalar path at compile time. This replaces the ad-hoc vector_lane > 2 and is_floating_point_v guards. Bugs fixed along the way: - Sub-lane UB in NEON: `ptr <= dest_end - N_lane` formed a pointer before the buffer when the input was shorter than one SIMD lane. The vector loop now runs a counted trip count (`blocks = (dest_end - dest) / N_lane`), which is UB-safe on sub-lane inputs and lowers to a `subs/b.ne` macro-op-fused back-edge on AArch64. The scalar remainder is inline instead of a recursive call into generic::. The same rewrite is applied to check_between. - check_between diagnostic: the SIMD body checked the >= max mask first and only looked at < min if the first was empty, so a later too-large lane could hide an earlier too-small one. Both bounds are now inspected before picking the returned pointer. - has_vectype: declared as size_t; retyped to bool to match its predicate role. tests/test_buffer_simd.py pins _simd_feature() == "NEON" on aarch64 so a silent fallback to the scalar path cannot pass unnoticed. It covers the int32 shape matrix (n=1, 3, 4, 5, 8, 17) for transform_binary, the int64-mul SFINAE fallback, and float sub/mul/div with one block + tail. A new private modmesh._modmesh._simd_feature() binding exposes the runtime-detected backend. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

tigercosmos · 2026-05-06T01:26:50Z

@yungyuc Please take a look. All comments are addressed. Please take a look, thanks!

yungyuc

LGTM

yungyuc · 2026-05-06T11:19:48Z

+        // Vector loop runs while a full lane still fits. Counted trip form
+        // for the same reason as transform_binary above: avoids UB on
+        // sub-lane inputs and the per-iter `sub` overhead.
+        size_t const blocks = static_cast<size_t>(end - start) / N_lane;
+        T const * ptr = start;
+        for (size_t block = 0; block < blocks; ++block)


I did not know that. Good to learn.

yungyuc · 2026-05-06T12:08:02Z

+inline void add(T * dest, T const * dest_end, T const * src1, T const * src2)
 {
-    T * ptr = dest;
-    while (ptr < dest_end)
-    {
-        *ptr = *src1 - *src2;
-        ++ptr;
-        ++src1;
-        ++src2;
-    }
+    transform_binary<T>(dest, dest_end, src1, src2, std::plus<T>{});
 }


Profiled on Apple M3 Pro (clang 17, -O3 -DNDEBUG -mcpu=apple-m1). Both the assembly and the throughput look clean.

The profiling results and assembly look good. But clang 17 looks old. The latest version provided by xcode is version 21.0.0 (clang-2100.0.123.102). Old compiler is OK since both before and after use the same version.

* Ignore: * readability-static-definition-in-anonymous-namespace * misc-use-anonymous-namespace * Add missing const * Use auto to avoid duplicate type name

tigercosmos force-pushed the issue646 branch from 30748c1 to 62e2ec3 Compare April 19, 2026 12:21

tigercosmos marked this pull request as draft April 19, 2026 12:25

tigercosmos force-pushed the issue646 branch 2 times, most recently from 488d5cd to 11a6eb1 Compare April 19, 2026 13:09

tigercosmos force-pushed the issue646 branch from 11a6eb1 to d778870 Compare April 27, 2026 20:11

tigercosmos changed the title ~~Refactor SIMD to xsimd-style loop injection~~ Unify SIMD arithmetic under a shared transform_binary template Apr 27, 2026

tigercosmos force-pushed the issue646 branch from d778870 to fcc3b50 Compare April 27, 2026 20:33

tigercosmos marked this pull request as ready for review April 27, 2026 20:38

tigercosmos commented Apr 27, 2026

View reviewed changes

tigercosmos force-pushed the issue646 branch from 858dced to 8ff634f Compare April 27, 2026 20:47

yungyuc requested review from KHLee529 and yungyuc April 27, 2026 22:53

yungyuc assigned tigercosmos Apr 27, 2026

yungyuc added performance Profiling, runtime, and memory consumption array Multi-dimensional array implementation labels Apr 27, 2026

yungyuc added this to tensor operations Apr 27, 2026

github-project-automation Bot moved this to Todo in tensor operations Apr 27, 2026

yungyuc moved this from Todo to In Progress in tensor operations Apr 27, 2026

yungyuc removed this from tensor operations Apr 27, 2026

yungyuc added this to tabular data processing Apr 27, 2026

yungyuc moved this to In Progress in tabular data processing Apr 27, 2026

yungyuc requested changes Apr 29, 2026

View reviewed changes

KHLee529 reviewed Apr 29, 2026

View reviewed changes

tigercosmos force-pushed the issue646 branch from 8ff634f to ed7842b Compare May 6, 2026 01:01

tigercosmos commented May 6, 2026

View reviewed changes

tigercosmos force-pushed the issue646 branch from ed7842b to 1243f04 Compare May 6, 2026 01:23

yungyuc approved these changes May 6, 2026

View reviewed changes

Fix clang-tidy

624af76

* Ignore: * readability-static-definition-in-anonymous-namespace * misc-use-anonymous-namespace * Add missing const * Use auto to avoid duplicate type name

yungyuc merged commit 3142aec into solvcon:master May 6, 2026
14 checks passed

github-project-automation Bot moved this from In Progress to Done in tabular data processing May 6, 2026

Op	BEFORE	AFTER	Δ
`add<float>`	14.91	14.78	−0.9%
`mul<float>`	14.82	14.77	−0.4%
`div<float>`	14.78	14.76	−0.2%
`add<int32>`	14.87	14.71	−1.0%
`add<double>`	5.63	5.69	+1.2%

		if platform.machine() in ("arm64", "aarch64"):
		self.assertEqual(feature, "NEON")

		self.skipTest("_simd_feature() = " + feature)


		class SimdTransformBinaryTC(unittest.TestCase):

Conversation

tigercosmos commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

Bugs fixed along the way

Profiling

Tests

Follow-up

Test plan

Uh oh!

tigercosmos left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yungyuc commented Apr 27, 2026

Uh oh!

KHLee529 commented Apr 28, 2026

Uh oh!

yungyuc left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KHLee529 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

tigercosmos commented Apr 19, 2026 •

edited

Loading

yungyuc left a comment •

edited

Loading